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[57] ABSTRACT 

A method and apparatus for correlating faults in a network- 
ing system. A database of fault rules is maintained along 
with and associated probable causes, and possible solutions 
for determining the occurrence of faults defined by the fault 
rules. The fault rules include a fault identifier, an occurrence 
threshold specifying a minimum number of occurrences of 
fault events in the networking system in order to identify the 
fault, and a time threshold in which the occurrences of the 
fault events must occur in order to correlate the fault. 
Occurrences of fault events in the networking system are 
detected and correlated by determining matched fault rules 
which match the fault events and generating a fault report 
upon determining that a number of occurrences for the 
matched fault rules within the time threshold is greater than 
or equal to the occurrence threshold for the matched fault 
rules. 
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NETWORK FAULT CORRELATION Another fundamental shortcoming of prior art network 

diagnostic techniques is that such prior art techniques typi- 

This is a continuation of application No. 08/626,647, cally rely upon a single count of a number of errors of a 

filed Apr. 1,1996, now abandoned, which is a continuation particular type occurring. This technique, known as 

of application No. 08/337,085, filed Nov. 10, 1994, now 5 "filtering", has fundamental shortcomings in that it does not 

abandoned. provide for other types of measurement of faults such as 

time such faults are occurring, number of faults within a 

BACKGROUND OF THE INVENTION givcn ^ pcriod> or other more sophisticated approaches. 

1 Field of the Invention Moreover, some prior art diagnostic systems only provide 

Hie present invention relates to networking systems. 30 ' ecord ;> of faults > but do not > b i ascd U P 0D other measured 

More specifically, the present invention relates to a corre- f ault characteristics, attempt to determine a possible reason 

lation method and apparatus for correlating network faults for a ° r S rou P of faul *> and moreover, do not offer any 

over time in order to perform network fault management in P ractlcal soluUons to a network mana S er or other user 

a networking system, and monitor the health of devices in Other prior art solutions to network fault management 

the network. 15 include displaying the status of network devices in a manner 

2. Background Information which allows > at a &* ncc > t0 determine whether a given 

, . A f c a jx i\~ ui device is functioning or not. These solutions include dis- 

As networking systems proliferate, and the problems . . 1 A a- • *** <-j- 

. , . c . j ■ i * • i_ ,1 i playing color-coded iconic representations of devices on a 

involved m configuring and maintaining such networks also ^ * ^ £ 0f 

increase, network management becomes an increasingly M ^ oetwor king system. This solution fails to take 

complex and t.me-consura.ng task. One of the pnmary accounl in[ermiUent &aum of , inks and/or devices 

considerations in managing networks, especially as their ... . . . . . 4 . Ct , u 

. c . 6 6 t ' r . / . c which may only occur a single time, the time of the poll, or 

size mcreases, a fault management Very simple types of di J { and wnic „ do S nol necessarily pose an P y sub . 

fault management may be performed by determining when £ * ^ J P * 

communication links and devices in the system fail, then, . 4 . , t • 

- 4 . u 25 solutions show network health in this manner using user- 
some sort of corrective and/or diagnostic measures may be , a , . . ,. ... , , ° 
* t . . . • » *• p u ■ defined state machines which are used by the management 
taken such as manual connection and reconnection oi physi- , n ., f 4l _ ... n i i 

. - « . j j ■ . , . , j • console. Both of these solutions usually rely upon simple 

cal links, and testing and diagnosis of network devices . c , 4 . - . . , J . j : f, 

d/ d t' displays of representations of individual devices in the 

and/or en s a ons. network rather than displays at various levels of abstraction 

More sophisticated techniques for fault detection and 30 in ^ system> includingj the ^ slol> chassis and device 

diagnosis include receiving traps or other types of alert kvd In additiotlj none of these prior art solutions use 

signals from devices within the network. As faults are networR lopology ^formation in order to determine how 

detected, devices can alert a centralized device such as a network hcalth chaagcs for rcktcd dcviccs causes corfC . 

computer system or other networking system management sponding changes in each of the related devices > healtn . 

console, that such faults have occurred. These prior art 35 . 

techniques have suffered from some shortcomings, however. fc ™ us > the P nor * of network fault detection and network 

First, typical prior art fault detection and diagnostic systems health ™nitonng has several shortcomings. 

include centralized consoles which receive and record fault SUMMARY AND OBJECTS OF THE PRESENT 

alert signals or traps as they occur. Management tools which INVENTION 

provide such diagnostic capability frequently rate faults 4Q 

received from units in the networking system according to One of the objects of the present invention is to provide 
their severity. Unless a certain number of traps are received * network fault diagnostic system which does not rely upon 
of a particular type, according to predefined rules, then no a single device for detecting faults, 
action is taken upon the traps. Another of the objects of the present invention is to 
A fundamental problem with these pre-existing systems is 45 provide a diagnostic system in a networking which uses a 
that because functionality is concentrated in a single device number of variables for determining whether faults of a 
in the network, networks errors at various devices in the certain type have occurred in the system, 
network may not be able to be detected. Moreover, these Yet another of the objects of the present invention is to 
errors may occur in such a volume that actual network errors provide a system which, when given certain criteria, pro- 
may be obscured. In fact, some errors may be lost due to the 50 vides probable causes and possible solutions for such faults 
large volume of faults at the single device. Because a large in a networking system. 

amount of faults may be generated which do not indicate any These and other objects of the present invention are 

specific problems in the network (e.g., transient faults), provided for by a computer-implemented method for corre- 

errors indicating actual severe faults actually requiring lating faults in a networking system. The method establishes 

action may go unnoticed. 55 a database of fault rules, and associated probable causes, and 

Yet another shortcoming of certain prior art systems possible solutions for determining the occurrence of faults 

includes the ability to determine whether the detected faults defined by the fault rules. The fault rules include a fault 

are indicative of a one specific problem identified by the identifier, a description of the fault, a possible cause for the 

fault type, rather than a symptom of a different problem. fault, a probable solution for the fault, an occurrence thresh - 

Multiple faults of a specified fault type may need to be 60 old specifying a minimum number of occurrences of fault 

detected in order for a one particular problem type to be events in the networking system in order to identify the fault, 

identified. Thus, individual faults which are detected are a time threshold in which the occurrences of the fault events 

simply "raw" error data and don't necessarily indicate an must occur in order to correlate the fault, and a severity 

actual problem. These may, given certain circumstances, value indicating the severity of the fault. Occurrences of 

indicate a specific problem, and current art fails to 65 fault events in the networking system are detected and 

adequately address the correlate multiple faults over time correlated by determining matched fault rules which match 

intervals to identify specific problems. the fault events and generating a fault report upon deter- 
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mining that a number of occurrences for the matched fault BRIEF DESCRIPTION OF THE DRAWINGS 

rules within the time threshold is greater than or equal to the _ ...... 

occurrence threshold for the matched fault rules. The ™ e invention ^ illustrated by way of example 
descriptions, the probable causes, the possible solutions and not limitation in the figures of the accompanying in 
the severity value for each of the fault reports may then be s wmch Uke references indicate like elements and in which: 
displayed for diagnosis of the faults in the networking FIG. 1 illustrates a block diagram of devices in a net- 
system, working system in which embodiments of present invention 

The correlating of the occurrences of the fault events may may be implemented, 

be performed by a plurality of distributed network devices piG. 2 shows a block diagram of a single network device 

each responsible for correlating fault events generated by a ]Q which may 5c uscd for implementing some of the described 

subset of devices in the networking system. A management techniques, 

device may then record the fault reports generated by each _ T _ r 

of the plurality of distributed network devices, and present F L ia 3 lllust 1 rates a of *\P r ° C , esS€ * 

the descriptions, the probable causes, the possible solutions w^ 10 a nctwork management station (NMS), and a network 

and the severity value for each of the fault reports. In this c° ntro1 en gi ne (NCE) m embodiments of the present inven- 

way, the load of correlating faults is distributed, and an 15 uon * 

efficient level of abstraction for examining the traps in the FIG. 4 illustrates a detailed view of a correlator used in 

network may be maintained. implemented embodiments. 

In implemented embodiments, fault rules further may pjc 5 illustrates a state update message used in imple- 

include an escalation threshold and escalation reports are mented embodiments 

generated upon determining ;lhat a previous , fault report has *> ' illustrating how state 

been generated for each of the matched fault rules, then , , 4 , f . ° . . , . t . 6 , . 

determining whether the number of fault event occurrences chan S cs arc determined for individual network devices, 

since the time threshold is greater than or equal to the FIGS, la and lb illustrate a process for determining the 

escalation threshold. Certain of the fault rules may also be severity level for a state update message is determined, 

designated as toggle rules which are associated with at least 25 FIGS. 8 and 9 illustrate trap objects and raw trap records 

two types of fault events. In this instance, the step of which are used for correlating traps, 

correlating the occurrences of the fault events further com- piG M shows ^ s(nicture of a fouh corrclation rule in 

pnses generating a toggle fault report if the fault events are lemented embodiments . 

either of the at least two types of faults and current states of r , . . 

devices to which the fault events pertain are different than 30 FIG. U shows a structure of a metal-trap object which is 

states indicated by previous fault reports. Aset of state rules generated after correlating trap objects and a fault rule has 

for the devices irj the networking system may also be been mel - 

established wherein the fault correlators determine which FIG. 12 illustrates the operation of a fault correlation 

state rules match the fault events, and issue state change reduction rule. 

reports in order to record states of the devices in the 35 FIG. 13 illustrates the operation of a fault correlation 

networking system. State changes are determined based toggle rule. 

upon the interrelationship of network objects, as determined p IGS 14a and are flowcharts of a method for 

using stored network topology information. By means of correlating traps. 

state change operators (e.g. "increment" and "set") and ^ u & ^ fof determini whether the 

associated seventy values, network object operation may be 40 for ^ f has been met 

monitored and displayed. These state changes may also be " r 

propagated to other objects which include the affected ™. 16 illustrates a process which is performed for 

objects generating summary meta traps. 

These and other objects of the present invention are FIG. 17 illustrates the representation of a record for a state 

provided for by a fault correlation apparatus for use in a 45 ru ^ e - 

networking system which includes a plurality of fault cor- FIG. 18 shows a record which may be used for storing 

relators for coupling to the networking system, wherein each statistical information regarding correlated traps by a single 

of the plurality of fault correlators is responsible for corre- correlator. 

lating fault events from a subset of devices in the networking fig. 19 shows the user interface which may be used for 

system. Each of the plurality of fault correlators includes a 50 displaying current faults in a networking system, 

plurality of fault correlation rules each including a rule pjc. 20 shows an example of a user interface for speci- 

identifier, a time threshold, and a number of occurrences, a ^ . ^ display G f f au lts 

criteria matcher for determining whether the number of nQ n mustfates a ns&v imerface for me data 

occurrences of fault events having identifiers equal to the ^ & ^ &nd a k ^ 

rule identifier has occurred within the time threshold tor 55 n _ , r t . fo1llt 

c it. 1 im r r i» 1 j c 1* # and/or a possible solution tor the rault. 

each of the plurality of fault rules, and a fault report * . . . 

generator for generating fault reports upon activation of the ^ 22 illustrates a statistics user interface which may be 

criteria matcher. A fault recorder is coupled to the fault uscd for displaying the number and distribution of faults for 

correlators for recording the fault reports and a fault monitor a Particular correlator m a networking system, 

is coupled to the fault recorder for displaying the fault 6 o FIG 23 shows a network health summary user interface 

reports associated with possible causes and/or solutions for for displaying the state of devices in the network, 

faults represented by the fault reports. The fault recorder and FIG. 24 shows a user interface for displaying the distri- 

the fault monitor comprise a network management station bution of state changes for a particular network device. 

for use by a network manager. ncTAII _ nncPDIiy nnw 

i_- , c * 11 . e t , e DETAILED DESCRIPTION 

Other objects, features and advantages of the present 65 

invention will be apparent from the accompanying descrip- The present invention is related to a correlation system 

tion and figures which follow below. and method for use within a networking system. Traps and 
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solicited responses are correlated over time for determining face coupled to a CSMA/CD backplane for communicating 
faults in the networking system in order to determine network information with the KMM 213 and other NMM's 
probable faults, causes, and possible solutions to those faults in system 210. Note, also, that any or all of the components 
in the networking system. This system may be distributed 0 f system 210 and associated hardware may be used in 
among a number of correlators in a networking system, 5 various embodiments, however, it can be appreciated that 
making it especially useful for very large networking sys- any configuration of the system that includes a processor 
terns which require fault correlation at different levels of 202 and a communication device 215 may be used for 
abstraction, and which generate a high amount of fault event various purposes according to the particular implementation, 
traffic in some instances, actually indicating the presence of Jn Qne cmbodimcnt? systcm 2 10 is one of the Sun Micro- 
faults in the network. Although the present invention will be 10 svstems@ brand family of workstations such as the SPARC- 
descnbed with reference to certain specific embodiments, statkm brand workstatiori manufactured by Sun Microsys- 
includuig specific data structures, types of communication tems{g) of Mountain Vie w, Calif. Processor 202 may be one 
media, networking systems, etc., it can be appreciated by of the spARC brand microp rocessors available from SPARC 
one skilled in the art that these are for illustrative purposes lDternati onal, Inc. of Mountain View, Calif, 
only and are not to be construed as limiting the present 15 Kr A . A . c „ - . . 
■ «• r^u j , A-e.„ nt -* a „A n *u» r Note that the following discussion of various embodi- 
mvention. Other departures, modifications, and other ,. ... mi c en. r 

changes may be made, by one skilled in the art, without me f disc ^ d hereiQ *f H refer 'J*?** 1 * t0 4 senes . of 

departing from the teaching of the present invention. [ ouUnes ? hlch ^" generated m a high-leve Programmmg 

v B , , „ . , • , .■ language (e.g., the C++ language) and compiled, linked, and 

An example of a configuration of a system implementing , he * ^ ^ object ^ in em 21Q durf fof 

the methods and apparatus of the preferred embodiment is 20 e le b the SPARC ompiler available from Sun Micro- 
illustrated as system 100 in FIG. 1. H» methods and , ems(g) of Mountain view, Calif. It can be appreciated by 
apparatus to be described here are implemented in a network one jn ^ art hQ ^ , he fol]owi m6thods 
computing engine (NCE) 101. In other embodiments, the ^ & be . lemented „ ial 
methods and apparatus to be described here may be imple- hardware devi such M dixK{& , jc devices _ , ^ 
mentedinageneralpurposeworfotaUonwhichiscoupledto 25 ^ d application . specific integrated 
a network. In one embodiment, NCE 101 is coupled to the (ASIC>s) or other specialized hardware. The 
backplane of a network management module (NMM) 103 as description here has equa i application to apparatus having 
shown in FIG. 1. Network computing engine 101 further similar function 
may be coupled to a network management console 102 for ' 

communicating various control information to a system 30 J FIG. 3 illustrates the processes wruch are operative wUhm 

manager or other user of network management console 102. f ™ es in the networking system and devices of FIGS. 1 and 

As is illustrated, NMM's or concentrators as is described in 2 - Network dev ! ces < e *. ») 311 ln *? "^working 

US Pat. No. 5,226,120 of Brown et al., (hereinafter system communicate with a process known as the trap server 

"Brown") issued on Jul. 6, 1993, may be hierarchically 312 ln , a sul S le , devicc in lhc ^working system, such as a 

linked in a variety of configurations, such as that shown in 35 network con,rol en g l f • ° r olher d ™<* ,n ne work - 

system 100 of FIG. 1. A more detailed description of h«™8 a super agent host process 310 Communication , « 

network computing engine 101 will be described with f rfiwned m four ways: 1) traps being transmitted from SA 

f Rir 2 host 310 to the NMS 320 upon detection of certain condition 

reterence to ^ , y 2 ) the NMS 320 polling health data from the correlator 

Referring to FIG 2, a system 210 upon which one ^ £ SA * ^ p exch 3) „ { m „ or 

5!f £ % * C ° m P Utl "? en S me ( r CE 7^ g . ' 40 polling devices to determine functionality by the super ping 

101 of FIG. 1) of the present invention as "^nte li J J communicates via traps with the trap server 

shown. 210 comprises a bus or other communication means P oorfelator § ^ {q 

201 for communicating information, and a processing means ' J 

202 coupled with bus 201 for processing information. Sys- aeiermine siaius - 

tern 210 further comprises a random access memory (RAM) 45 Trap server 312 in SA host 310 serves buffering, storage, 

or other volatile storage device 204 (referred to as main and other functions in the device until correlated by fault 

memory), coupled to bus 201 for storing information and correlator 313. The trap server is a process that can be active 

instructions to be executed by processor 202. Main memory on a distributed NCE or on the NMS console. Its function is 

204 also may be used for storing temporary variables or to listen to SNMP traps which are received through port 162 

other intermediate information during execution of instruc- 50 which it binds to, in implemented embodiments, and for- 

tions by processor 152. System 210 also comprises a read ward them to any client application of the trap server 312. 

only memory (ROM) and/or other static storage device 206 Fault correlator 313 is one such client of trap server process 

coupled to bus 201 for storing static information and instruc- 312- 

tions for processor 202, and a data storage device 207 such Fault correlator 313 collects data by listening to incoming 

as a magnetic disk or optical disk and its corresponding disk 55 traps, polling devices for data via either RM server process 

drive. Data storage device 207 is coupled to bus 201 for 314, directly "pinging" devices (prompting devices to deter- 

storing information and instructions. System 210 may fur- mine their reachability) via super ping process 315, or 

ther be coupled to a console 211, such as a cathode ray tube polling a device to directly determine device status. In 

(CRT) or liquid crystal display (LCD) or teletype coupled to addition, it correlates traps not already correlated by another 

bus 201 for displaying information to a computer user. In the 60 correlator in the networking system. Every time a trap is 

implemented embodiments, another device which is coupled generated by an NMM or other SNMP agent (device or 

to bus 201 is a communication device 215 which is a means application) in the system or fault data is polled from an 

for communicating with other devices (e.g., NMM's or a NMM or other SNMP agent, the correlator invokes a current 

network management console — e.g., 202 of FIG. 2. This rule base to evaluate new events in the system in relation to 

communication device may also include a means for com- 65 known events. For the purposes of the remainder of this 

municating with other nodes in the network. In implemented application, fault events or simply events, consist of traps or 

embodiments, this may include an Ethernet standard inter- polled network data (e.g. SNMP "Get" exchanges) respon- 
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sive to certain network conditions. As each fault event is any ARS client connected to the server such as 326 illus- 

evaluated, a record of it is maintained, including such trated in FIG. 3. In addition, the long terra meta traps 

information as whether a similar event has been correlated generated by different correlators may also be maintained in 

before and thereby used to generate a "meta" trap. Meta the fault database and viewed by the network manager or 

traps will be discussed in more detail below, however, meta s other user during accessing by fault summary process 321. 

traps are generated by the fault correlator 313 upon the M oh . indicatin discrete fault events are 

satisfaction of a certain number of conditions. These may ^ ^ corrclat a corrclation algorithm is 

include, but are not hmited to the occurrence of a certain d tQ a meU * 

number of events within a specified time period. .This .will be ^ bc a(ed Corrcladon may bc pcr f ormc d, but is 

generally referred to as a rule and will be desenbed m more 10 mi necessaril limited t0 bei performedj upon each gen . 

detail below. Once a number of traps within a specified time q£ a ^ ^ Qne em5odimenU the ^ of 

period have occurred, the generated meta traps are sent to a a b ^ 313 the list of previously detected 

^T™' l yP1 1 Cally, ? # DCt J?? k maQagemcnt statl0n traps is scanned and each type is evaluated to see whether a 

(NMS) 320. The fault correlator 313 and super ping process ^ of ^ a med ^ ^ faaye 

315 both use a domain database to determine the device(s) ]5 ff acc( / di t0 pre . s tored fault correlation 

that they are responsible for correlating. Super ping process ^ a met& ^ created and transmiUed to the 

315 generates a trap whenever devxee reachability status ^ recofder 323 - n netW0fk managemenl station 320 A 

changes. Ttese traps may also be correlated by a single or corrclator may communicate fault and statc information via 

multiple correlators 313. meta traps to more one network management station 

In addition to the meta traps describing faults, the corr- 20 32 0 which are specified in a list maintained in the correlator 

elator uses a rule base to derive the state (health) or network and whicfa is ^.configurable. 

devices based on the traps of the data polled from the . , A „ c t „ 4 r w . , , . 

/cKTx^n * n ^ i _ , f _ A detail of a distributed fault correlator is shown in FIG. 

NMM s/SNMP agents. Every time the inferred or deter- . _ , 4 . , . t 4 , 

. j • j. fl „„i 4. The correlator is comprised of two main components: the 

mined state of a device changes, according to denned rules, n r , , . 

, . , . . 6 , J , tA n/i - rt correlation engine 410 and the state engine 420. The corr- 

a state change meta trap is sent to the trap server 324 in the 2 s , 4 « * f u i j * * nU , u 

XTW „ 6 . - r , , . # f „ A „ i-x elator operates accordmg to fault rules and state rules that 

NMS 320 and then forwarded to the state recorder 326. *\ . & . . r . , 

, , , , XTW _ „ A . , 4 control the two engines respectively. Fault correlation 

Meta traps are received by the NMS 320 via the trap e . 410 . g dfiven . eyents received from network 

server 324. The trap server 324 packages the traps and of other lication programs in the network 3U . 

passes them to the fault recorder 323 which maintains a long ^ correlation en ^ ne 410 may also perform active polling . 

term record of meta traps received. In this manner, a record 30 ^ actiye m ma also be performed t0 determine the 

is kept of long term faults m the system and diagnosed, if ^ q{ device(s) ^ aclive m may be imp i em ented 

required. The recorded faults are then put into a fault fau| {& ^ « SNMp ^ exch with 

database 322 which may be accessed by fault summary devices and/or lications ^ lne network 31L ^ evenl is 

process 321 for presentation of long term fault information interprctcd according to the fault rules, state rules, and the 

to a network manager or other user of the networking 3 5 currentknowledge abo ut the objects to which the correlation 

system. It may also suggest probable causes for the fault, me ^ event caQ [d of ^ followin 

and/or possible solutions to correct the identified faults. This * 

information is sent from the fault correlator 313 in the meta ^ ^ ndlQ S of a j ault rc P° ft rt ( mcta tra P> t0 the Network 

trap. Meta traps may also be recorded in an NM platform Management Station 320; 

event log 325 for retrieval at a later date or other post- 40 2. sending of a state update to the state engine; or 

processing. 3. scheduling of an "SNMP Get" request to determine 

In an alternative embodiment, meta traps can be received whether a fault condition is clear, 
by the NM platform event log and be post-processed by trap The state engine 420 is driven by state update messages 
server 324. In either embodiment, a log is maintained of received from fault correlation engine 410. Each message 
meta traps and the traps are serviced via presentation to the 45 refers to one network object (i.e., a device or application), 
user of the NMS 320 information regarding the meta traps The state of the network object is modified according to the 
such as reporting devices) and/or processes, probable message parameters, and propagated up the physical and 
causes, possible solution(s), and/or severity levels of the logical network hierarchies. Certain objects in the hierar- 
faults, according to implementation. Other processes which chies refers to objects on a particular device (e.g. a port or 
are operative within the network management station 50 slot). Thus, state changes on those objects result in state 
include the network health monitor 328 which extracts state changes on the affected device(s). Other objects in the 
information from the network database 327 which maintains hierarchy refers to device(s) and/or groups of devices (e.g. 
a record of all network devices and their current status. The all those devices in a particular building, or in a particular 
network health monitor 328 can then display the device(s) segment). The device hierarchies at this level may be 
(logical, physical or containers) in color-coded form accord- 55 maintained by a topology recording process in order that 
ing to their status. The database is maintained by state state changes effecting particular devices also effect the 
recorder 326 as illustrated in FIG. 3 which records state objects representing groups of devices (e.g. a building). This 
* changes received from the distributed correlators. Another topology recording process may accept manual changes to 
set of features provided by implemented embodiments of the the topology (e.g. as entered by a network manager) or may 
present invention is the action request system (ARS) trouble 60 be determined automatically, in any number of ways. Thus, 
ticketing system available from Remedy Corporation, of this network model is maintained by network object, accord- 
Mountain View, Calif. The fault correlator 313 also provides ing to each object's level in the hierarchy, and used to 
faults in the form of trouble tickets to an ARS gateway 331 propagate changes to other network objects higher in the 
which provides this information to a second process via ARS hierarchy (e.g. port or slot to a single device, a single device 
server 330. Upon receiving meta traps, the ARS gateway 65 to other devices/groupings of devices). 
331 generates ARS trouble tickets and issues them to ARS The state engine uses an aging scheme to gradually clear 
servers such as 330. The ARS server may be viewed from certain state conditions. If the severity of an object state 
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changes from low to medium, from medium to high, or any associates each of these traps with individual objects in the 

other state change in severity level, a state change notifica- network physical hierarchy such as the ports 1 620 and port 

tion is sent to the network management station 320 via a 5 621. The state changes into two ports results in state 

meta trap. Thus, in addition to network faults which will changes to the logical entities represented by the dotted arcs, 

cause fault meta traps to be generated, the correlation engine 5 an d the physical entities represented by the solid arcs. That 

will also generate state update messages which are sent to ^ state changcs thcn arc gcncrated for segments 630 and 

the state engine 420 in order to update the current status of 632 (c<g . VLAN's in the networking system) shown in FIG. 

network devices) and/or applic a tion(s). 6 In ^ ^ cha 0CCUf for dot fi22 as 

The state of an object is defined by a vector of severity , Ulustrated m the fi Finall because the siot 622 also 

levels assigned to the following fault categories: connectiv- Hrt u . * * . r »l * * «■ , ,* 

ity; error rates; components condition; lold; configuration; 10 ™ ult& ^ ch ™& for mother network object the 

and security. The severity level of each category is an integer ' hassis f 1 ' \f te change message also generated for that 

value from O-indicating fully-functional or operational device. Any other network object effected by the traps, such 

condition, to 10-meaning non-functional. Levels of 1-3 are as °}°> are Ranged in the same way. 

of low severity, 4-6 are medium, and 7-10 are high severity. ^ current seventy of a state category of a leaf is 

Some slates are permanent or long-term in nature (for 15 computed in the manner as illustrated with reference to 

example, a fan failure in a network device) while other FIGS la and 7b > and is propagated up through each device 

changes reflect a temporary or short-term problem (for (logical or physical) in the hierarchy. Process 700 is per- 

example, the saturation of a network device). When an event formed for each object, and starts at a typical process entry 

of the permanent type occurs, the state condition is cleared point as illustrated in FIG. 7a. At step 702, all increment arcs 

only on evidence that the problem has been resolved. State 20 for the object are sorted according to their maximum sever- 

changes of the temporary type will stay in effect for a certain ity for any given increment, an integer value between 1 and 

period of time and then be decayed. The state engine is fed 10. Maximum severity limits for any given increment are 

by state update messages generated by the correlation engine established in a configuration file as preset by a network 

410. Update messages have the format as illustrated in 500 manager or a user and is stored in the names as discussed in 

of FIG. 5, and contain the several parameters: A message 25 FIG. 5 above. Once the increment arcs are divided into these 

identification field 501 which can be used to update or undo 10 groups, then a sum of the severities for the increment arcs 

the effect of this state change. Subsequent state changes will of each group is determined at step 704. For each group, the 

have the same message ID. An object identifier field 502 lowest of either the increment sum and the maximum 

which uniquely identifies a physical or logical object in the severity for the group is then determined at step 706. At step 

network. The state category field 503 indicates one of the 30 707, the highest of the 10 increment arcs, as determined at 

types state categories as referenced above: connectivity; step 706, is then used as the accumulated increment sum to 

error rates; components condition; load; configuration; or be used for further processing in conjunction with the "set" 

security. The severity level field 504 contains the severity arcs. Then, process 700 continues on FIG. lb for processing 

level, an integer between 0 and 10. The update mode field of the set arcs and determination of the new severity level. 

505 is a Boolean value indicating either "set" or "incre- 35 The portions of process 700 shown in FIG. lb account for 

ment." This field specifies how the given severity should be all the "set" arcs for any given device. At step 708, it is 

used to compute the new severity level. Typically, a "set" determined whether there are any more "set" arcs (state 

change is used for longterm problems while short-term changes) for the device. If so, then it is determined whether 

problems require a "increment" update to reflect an accu- the currently examined set arc at step 709 is the highest 

mulated effect. "Set" is an absolute change in state, while an 40 examined yet. If so, then this set arc is used as the highest 

increment is a value indicating an addition to the current set arc at step 710. If not, then the highest set arc is not 

severity state. The next field is a maximum limit field 506 adjusted and the process returns to step 708 to determine 

which is used in the case of the "increment" update. This whether there are any more set arcs for the device. Once 

indicates a maximum limit for the determined severity level. there are no more set arcs for the device as detected at step 

In "set" type messages, the maximum limit field is null or 0. 45 708, it is determined whether the accumulated increment 

Finally, an aging interval field 507 is used for indicating, in sum exceeds the highest set arc at step 711. If so, then the 

seconds, whether the effect of this update should be undone. severity for the device is set equal to the accumulated 

This is applicable only for increment types of updates. increment severity at step 712. If not, then the state change 

Again, for "set" types of updates, the field will be undefined. for the adjustment in severity is set to the highest set arc at 

The computation of network object states is done using a 50 step 713. In either event, upon completion of steps 712 or 
neural network model that is comprised of nodes and arcs 713, the device then has its state changed to the appropriate 
connecting them. Nodes represent network objects. Arcs severity level, via the transmission of a "set" state change 
represent state updates. All arcs are uni-directional forming trap or message to the NMS 320 from state engine 420. The 
a tree structure. Each node has outgoing arcs connected to its state change is then recorded by the state recorder 326, used 
physical and logical container objects through which states 55 to update the network database 327, and is eventually 
are propagated. Leaf nodes are lowest objects in the network displayed upon the network health monitor 328. 
physical hierarchy, typically ports or interfaces on network The fault correlation engine 410 is implemented using an 
devices. Every state update message from the correlation enhancement of the C Programming Language known as 
engine is translated into and arc entering the leaf node that Rule extended Algorithmic Language (RAL) which is avail- 
represents the relevant object. The hierarchy may be main- 60 able from Production System Technologies. RAL is an 
tained internally in the correlator using data structures extension of C that incorporates a rule evaluation algorithm 
well-known to those skilled in the art. These arcs have a for evaluation of fault events and, if necessary, generate fault 
limited life-span, that is, they are removed once the condi- reports (meta traps). Upon invocation the correlator registers 
tion is clear. FIG. 6 illustrates an example of a network with the trap server to receive all not previously correlated 
model used for state updates. 65 traps. Each time a trap is received, it is parsed and converted 

As illustrated in FIG. 6, traps are sent to the correlation into an RAL object instance. The correlator can then scan its 

engine 410 as shown in FIG. 4. The correlation engine then set of rules to determine if the trap object matches any of the 
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specified fault rules, based upon the trap ID, the originator 
of the trap, and the trap attributes. This is accomplished by 
pushing the rule set onto the current context and invoking 
RAL's forward driven evaluation engine. 

The fault correlator of implemented embodiments of the 
present invention currently supports two types of fault rules. 
The first type is known as a reduction rule that correlates a 
large number of traps of a similar type which originated 
from the same device into a single event. This process may 
then generate a meta trap. Rules of this type have the 
following format: 

if a <trap-type> has occurred more than 
<number-of-occurrences> 
within the last <time-interval> => report a fault 
For example, if a particular type of trap has occurred more 
than five times within the last fifteen minutes, then a meta 
trap or fault report is generated. 

The second group of fault rules which is supported and 
implemented in embodiments of the present invention are 
known as so-called "toggle" rules. Toggle rules are those 
that correlate at least two different trap or event types which 
causes a fault report to be generated to indicate the change 
in condition according to a last trap/event received. For 
example, if a communication link goes down a "link down" 
trap is received and the state of the link is "down." Subse- 
quent "link down" messages are ignored because they do not 
result in a change in state for the device. If a "link up" trap 
is subsequently received, however, for the same device, the 
rule is thus toggled to the other ("up") condition, and a meta 
trap is generated indicating the change in state. 

FIGS. 8 and 9 show the format of records which are used 
for processing raw trap objects. FIG. 8 illustrates 800 which 
is a raw trap object which is received from a device in the 
system, for example one of the NMM's in the network 311 
shown in FIG. 3. 800 is a record which has a format which 
is generated in conformance with SNMP traps as specified 
in the networking industry standard document RFC 1215. 
800 comprises a type field 801 which indicates the type of 
raw trap which has been detected within the network. As trap 
objects are received, they are maintained in a list for 
evaluation by the fault correlator 313 according to fault 
rules, which are specified by rule number. The specific 
definition of fault rules including the rule numbers is dis- 
cussed in more detail below. In addition to the trap type, the 
trap object further comprises a time field 802 indicating the 
time at which the condition was detected. The trap object 
also contains a field 803 indicating the network address, in 
this instance, the internet protocol (IP) address, of the device 
reporting the trap, This allows the determination of faults 
with particular device(s) in the network. Lastly, the trap 
object contains variable-value pairs (var-binds) in field 804 
which specifies, in more detail, the specific device(s), ports, 
attachments or other pertinent data for the trap in order to 
determine at correlation time whether the rule has been 
matched or not. 

When traps are received by the correlation engine 410, 
they are converted to a raw trap record 900 shown in FIG. 
9. 900 of FIG. 8 is an RAL object instance of the trap 800 
discussed with reference to FIG. 8, in addition to several 
fields 905-908 which are used for control of the correlation 
process to determine whether the particular trap has been 
correlated or not, the number of occurrences of the event, 
etc. . . Count field 905 contains a counter of occurrences of 
the event of the given type. "Visited" flag field 906 indicates 
whether this trap has been examined by the correlator. 
"Fired" flag field 907 indicates whether the trap record has 
been used to generate a meta trap which was transmitted to 
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the trap server 324 in the NMS 320. Lastly, in field 908, the 
time of the last of the events within the "event threshold" 
time period is recorded. 

Fault rules are specified by a network manager or other 

s user by defining the trap type, the number of occurrences 
and trap interval for creation of a meta trap object. These are 
defined in a text file and maintained in a data structure 
accessible to the correlation engine 410 at run-time. The data 
structure of fault correlation rule records has the format as 

10 illustrated in FIG. 10 and may be represented internally in 
the correlator using well-known programming techniques. 

1000 of FIG. 10 shows an example record which may be 
used for maintaining a single fault correlation rule in imple- 
mented embodiments of the present invention. The record 

35 will include a first field 1001 which is a short string 
containing a short name of the rule. This field is followed by 
the rule number field 1002 which is used for associating 
received traps with the defined rules. The rule number is an 
integer which is used for matching traps to define rules. The 

20 next field is the problem type field 1003 which is a long 
description of the type of fault which has been identified. 
The following field is a problem description field 1004, 
which is a long English-language description of the problem 
identified by the rule. The next field 1005 is a severity field 

25 which is an integer value indicating the relative severity of 
the rule, if it is determined to be correlated, which ranges 
from 0-10 wherein 0 is an operating condition and 10 is a 
failure. 

Event threshold field 1006 contains an integer value 

30 specifying the number of occurrences of the trap which 
before the rule will be considered to be matched and thus a 
fault record or meta trap is generated. The next field in the 
record is the time interval field 1007 which is an integer 
value specifying a time, in seconds, in which the number of 

35 threshold traps specified in field 1006 must be met in order 
to match the rule. The next field is the age time field 1008 
which is used for aging out traps. The RAL object instance 
is removed after the "age time" threshold specified in this 
field has expired. The escalation threshold field 1009 is an 

40 integer value representing the number of new traps after a 
rule has been matched which must be received within the 
time interval in order to generate another meta trap known 
as an "escalation" trap indicating that the problem severity 
should be increased. Again, this is an integer value indicat- 

45 ing the number of events which must occur after the initial 
firing of a rule. "Rule Active" field 1010 is for specifying 
whether the rule is active or not. It is a Boolean value, 
specifying whether for this given session of the correlator 
that the rule should be activated or not. Finally, the fields 

50 1011 and 1012 are the problem cause and problem solution 
fields which are text fields containing probable problem 
causes for the fault and problem solutions in text form. 
These are later used by the fault user interface for displaying 
the current status of the network and possible problems 

55 along with their associated solutions. 

An example of fault correlation rule for use in an 
Ethernet-type network is shown below. This uses the format 
as specified in the record 1000 shown in FIG. 10 discussed 
above: 

60 



65 



Type: LocBridge OperChnge 


#Scvcrity 


37 


4 


#EVENT_THRESH 


37 


2 


#TI ME_INTERVAL 


37 


300 


#AGE_TIME 


37 


3600 
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-continued 



Type: LocBridgc OpciChngc 



#ESCALATION_THRESH 


37 


2 


#RULE_ACTIVE 


37 


1 


#PROBLEM_CAUSE 


37 


The local bridge's 






spanning tree algorithm 






has resulted in changing 






its state between active 






and standby. 


#PR0BLEH_SOLN 


37 


Check if the state change 






was due to a network 






failure requiring a 






topology change. 



Note that this is shown for example only, and that many such 
rules may be defined for any variety of devices and/or 
networks, according to implementation. Note than any or all 
of the rule parameters may be modified in an 
implementation-dependent manner, This may include 
manual editing by the user of configuration files containing 
rule definitions, or by the issuance of directives to a com- 
puter program (e.g. resident in the SA Host 310), such as a 
configuration manager, which can create, modify, or delete 
any and/or all of the rule parameters, or rules. 

When a fault condition is triggered by matching of the 
predefined rules as discussed above, by the occurrence of a 
specified number of traps of a specified type of fault have 
occurred within a certain number of times within a given 
time interval then a meta trap object is created and sent to 
fault recorder 323. A meta trap object is illustrated with 
reference to 1100 of FIG. 11. The meta trap object contains 
the following fields: A problem type field 1101 is a name 
identifying the problem; a description field 1102 which is a 
long English language description of the identified problem; 
the unique identifier of the device 1103 which contains a 
name, address, or other unique identifier for the device; an 
agent address field 1104 which is used for identifying the 
address of the agent reporting the event; a fault category 
field 1105 for specifying the rule number of the fault; a 
severity field 1106 indicating the severity of the problem 
which has been detected, typically represented as an integer 
between 0 and 10; a vendor field 1107 indicating the vendor 
name of the affected device; a probable cause field 1108 
which contains text indicating a probable cause for the 
reported problem; a possible solution to the problem 1109 
which is a text field indicating possible solutions to the 
identified problem; a counter field 1110 indicating the num- 
ber of correlated traps to generate this meta trap; a correlator 
identifier field 1111 indicating a name, address or other 
unique identifier of the correlator generating the meta trap 
object; a meta trap ID field 1112 — a unique identifier for the 
meta trap object being generated; and finally, a correlation 
field 1113, in one embodiment represented as an integer 
value, indicating whether the meta trap is a "normal," 
"escalation," or "summary" trap. 

RAL objects such as 900 are not deleted after a rule has 
been matched. The counting and evaluation of raw trap 
objects continues until expiration times for each rule have 
been reached. At those times, RAL trap objects instance(s) 
are deleted. After a rule has been matched and the receipt of 
additional traps of the same type, a special rule is invoked 
to determine whether a specified number of new traps were 
received since the previous meta trap was sent. If this 
threshold is exceeded, then another meta trap object having 
the same format as 1100 of FIG. 11 is generated which is 
known as a "escalation" trap. The escalation trap has a 
similar format with the only exception that the trap count 
field is increased, and the severity level is increased by two 
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indicating that the problem is more serious than previously 
reported. This again may assist the user in determining 
whether a more serious condition is present in the affected 
device(s). In order for an escalation trap to be generated, an 

5 escalation number of traps should be received, as specified 
in field 1008, since the initial generation of a meta trap in 
order to generate an escalation trap. The format of an 
escalation meta trap which only update the number or count 
of correlated traps field 1109 are referred to as summary 

10 traps. Summary traps are generated every "age-time" inter- 
val since the first raw trap for the rule arrived provided that 
additional traps arrived between the last meta-trap that was 
sent and the "age- time" interval specified in field 1007 of the 
fault rule for the type of event. As discussed above, normal, 

is escalation, and summary meta traps are all distinguished by 
unique integer values in field 1111 of the meta trap object. 

FIG. 12 illustrates graphically for any given reduction 
rule, the relationship between the generation of initial fault 
report, the generation of an escalation report, and the gen- 

20 eration of a summary report caused by specified numbers of 
traps within given time interval. Thus, in the illustration 
shown as 850 of FIG. 8fc, for the rule illustrated a fault report 
which was generated on five traps within a 30 minute (1800 
second) time interval. Upon five traps being detected a fault 

25 report will be generated with the severity level 6. Upon an 
additional two traps detected after the first meta trap was 
sent, an escalation report will then be generated with the 
severity level increased by two, that is, severity level-8. 
Finally, a summary report is generated when the expiration 

30 time of the first raw trap occurs. 

Toggle rules are defined in the same way as standard rules, 
however, rules for matching events (e.g., "link-up" events 
and the "link-down" events), are defined by the same rule. 
Toggle rule traps are treated in a slightly different way than 

35 standard traps. The event threshold and escalation threshold 
fields are ignored. RAL object instances are generated upon 
each reception of a toggle rule trap which changes from a 
previous state of the device. These are defined in a table by 
rule number which is separately specified in embodiments of 

40 the present invention. The age time field is used for toggle 
rule traps wherein if another trap does not occur within the 
time interval specified in the age time field then the RAL 
instance of the trap is disposed of. Treatment of toggle rules 
is otherwise the same as for reduction rules. 

45 The operation of "toggle" rules is graphically illustrated 
with reference to FIG. 13. In contrast to reduction rules, 
toggle rules simply generate only a single type of meta trap 
upon a state change. No escalation or summary reports are 
generated. As previously discussed, toggle rules map to at 

so least two events, wherein each causes the effected object(s) 
to change state. In most implemented embodiments, most 
toggle rules initially fire upon a change in state of the object 
from some default condition (e.g. link "up") to some other 
condition which is the opposite of the default (e.g. link 

55 "down"). As illustrated in the diagram 1300 in FIG. 13, the 
change in state to the "down" condition, results in the 
correlator polling, at regular intervals which are user- 
defined, the effected network object(s) in order to determine 
whether the state has resumed it's default. Thus, as 

60 illustrated, an initial trap is received indicating that the link 
is "down" and a fault report is generated. Subsequent 
thereto, SNMP Gets are generated at some regular interval 
wherein no additional meta traps are generated until a 
subsequent state change occurs. In the illustrated example, a 

65 "cleared" fault report is generated when the link resumes a 
state of "up" and no additional SNMP Gets nor fault reports 
are generated. That is until, again, a change in state is 
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detected, as illustrated in FIG. 13. Any number of types of 
events may be collapsed into a single toggle rule allowing 
this "toggling" between a first and second state (e.g. "up*' 
and "down"). 

FIGS. 14a and 14£> illustrate a process which is performed 
upon detection of a trap by trap server 312 as passed to fault 
correlator 313. As previously discussed, upon detection of 
traps, the correlator will determine whether a given number 
of traps within a specified type have met the rule criteria 
within the time threshold specified by the rule. 

Process 1400 of FIGS. 14a and 14b is a detailed illustra- 
tion of the process steps which are performed upon the 
detection of each trap from devices in the network. First, at 
step 1402, a RAL object instance of the trap is created. The 
list of recorded traps may then be examined at step 1404 to 
determine the matching fault correlation rules. If the match- 
ing fault is a "toggle" rule as determined at step 1405, then 
it is determined whether the new trap has caused a change 
in state for the device at step 1406. If so, then a meta-trap is 
generated at step 1407 reflecting this change. If, however, 
the new trap relates to a standard reduction rule (it is not a 
"toggle" rule), then it is determined whether the rule criteria 
for the new trap have been met at step 1408. This may be 
determined by performing a test to see whether a threshold 
number of recorded traps of the given type have occurred 
within the threshold time interval. A more detailed view of 
step 1408 is shown in FIG. 15. If the rule criteria have been 
met, as detected at step 1408, then it is determined at step 
1409 whether a previous meta trap has been generated for 
the particular rule. If so, then it is determined whether the 
escalation threshold has been met at step 1411. If so, then an 
escalation trap is generated at step 1412 — the number of 
traps has exceeded that required for generating an 
"escalation" -type meta-trap. Process 1400 then continues on 
FIG. 14b. 

If a previous meta trap had not been generated, as detected 
at step 1409, then an initial meta trap is generated and sent 
to the NMS 320 at step 1410. Upon generation of either an 
escalation or meta trap, RAL object instances may be purged 
from the meta trap queue, according to their respective 
age -out time. This is shown in FIG. 14b, Process 1400 
proceeds to FIG. 14b wherein at steps 1416-1422, meta 
traps are aged-out. That is, at step 1416 the age-out time for 
each meta trap instance is examined (according to its type) 
from the last meta trap generated. It is determined whether 
the last meta trap generated was greater than or equal to the 
age-out time specified in the rule. If so, then, at step 1417, 
it is determined whether the trap count for the meta trap has 
changed since the last generation of the meta trap. If so, then 
a "summary" meta trap is generated at step 1418. If not, or 
on the completion of step 1418, the meta trap RAL object 
instance is deleted at step 1419. It is then detected at step 
1420 whether any more traps need to be examined. If so, 
then the next recorded trap at step 1422 is obtained, and 
steps 1416 through 1420 are again repeated until there are no 
more unexamined meta traps in the queue. In this way, meta 
traps are aged -out and removed from the queue, 

FIG. 15 illustrates a more detailed sequence of process 
steps which may be performed at step 1408 in FIG. 14a to 
determine whether the rule criteria have been met. In imple- 
mented embodiments, the matcher in the RAL engine is 
used. However, in other embodiments, a sequence of steps 
as illustrated may be performed. At step 1502, the recorded 
traps are examined. At step 1506 it is determined whether 
the trap being examined is of the same type and within the 
specified time threshold for the rule for the recorded trap 
being examined. In order for the specified type condition to 
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be met, the correlator examines the var-bind pairs of the trap 
in order to determine the manner in which the rule is applied. 
For example, if two events occur which have the same rule 
number but affect different devices, and the rules are only 

5 relevant if the events occur to the same devices (e.g. same 
range of IP addresses) then they will be treated as separate 
types of events, and will not be correlated together. The 
application of the var-bind pairs to make these distinctions 
is hard-coded into the underlying program, however, var- 

10 bind pairs in the rules themselves are user-configurable, 
according to implementation. Different applications of the 
var-bind pairs may include, but not be limited to, specifi- 
cation of certain port numbers, MAC (media access control) 
addresses, current attachment of device(s), or other infor- 

15 mation contained within the var-bind pairs. If the examined 
trap is of the current type and has occurred within the time 
threshold, then at step 1507, the RAL instance for the trap 
is disposed of. This is because the current trap is then 
recorded into the recorded trap being examined. The occur- 

20 rences counter is of the recorded trap is then incremented at 
step 1508. Then, step 1508 proceeds to step 1514. 

If there was not a match of the type of trap and time 
threshold, as detected at step 1506, then step 1510 is 
performed to determine whether there are any more recorded 

25 traps to be examined. If not, then 1510 proceeds to 1514. If 
so, however, at step 1512 the next recorded trap is received 
and process steps 1506 through 1510 repeat. Once all 
recorded traps in the trap queue have been examined at step 
1510, then it is determined whether the occurrences counter 

30 has exceeded the events threshold for the trap. If so, then the 
process returns at step 1516 with a flag indicating that the 
"criteria have been met". If not, however, then a specified 
flag criteria not met may be passed back to the calling 
routine at step 1518. 

35 Process 1600 shows a process which is performed at 
regular intervals, according to implementation, which is 
used for determining whether summary traps should be 
generated. In effect, this process handles the additional 
circumstance where the trap count has changed, but yet a 

40 standard meta or escalation trap did not necessarily get 
generated. In this instance, a summary trap is generated. 
Thus, process 1600 commences at a typical process entry 
point, and it is determined at step 1602 whether there are any 
additional RAL object instances which haven't been exam- 

45 ined. If so, the process continues, and the RAL object 
instance for the meta trap is examined at step 1604. If the 
age -time has not expired for the examined RAL instance, as 
determined at step 1606, then the process continues and 
returns to step 1602. If so, however, then the process 

50 proceeds to step 1608 to determine whether the trap count 
has been incremented since the last meta trap was generated 
(i.e. no additional meta or escalation trap was generated). If 
so, then a summary trap is generated. If not, then the RAL 
object instance is deleted at step 1609. The process contin- 

55 ues at step 1602 wherein if no unexamined RAL objects 
remain, then the process is complete. This process is per- 
formed by the correlator independent of whether traps are 
received or not in order that summary traps may be gener- 
ated when trap counts change, even if no additional prob- 

60 lems (e.g. escalations) have been identified. 

In addition to generating fault reports (as meta traps ) to 
the network management station, the correlation engine also 
generates state updates which are transmitted to the state 
engine shown in FIG. 4. State updates are generated accord- 

65 ing to state rules which are generated according to the state 
diagram discussed with reference to FIG. 6 above. State 
rules are associated with traps which may be generated, and 
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are stored according to rule number of the traps which are statistics are reset every time the correlator is reset. The 
detected. An example of a state rule is illustrated with correlator statistics may be reset by an SNMP request from 
reference to FIG. 17. Like a reduction or toggle rule, each the fault summary process 321, or may be reset by the 
state rule is referenced by a rule number 1701 which issuance of a signal from another device capable of corn- 
corresponds with rule numbers for each of the traps as 5 municating with the fault correlator 313. The counters 1800 
previously discussed. Thus, an event of a given type may are useful for obtaining statistics about a given correlator, 
also result in a state change occurring, as detected by These statistics may be displayed by a user in order to 
correlator 313 and received by the NMS for the monitoring determine the current performance of the network, and the 
of network health. The next field is a state category field performance of the particular correlator. This information 
1702 which is an integer value specifying whether the state 10 may be used for generating the display window 2200 of FIG. 
rule refers to a connectivity, error rate, components 22. 

condition, load, configuration, or secured state rule. The next FIG. 19 illustrates an example user interface display 

field is severity level field 1703 which is an integer value window 1900 which may show to a network manager or 

between 0 and 10 referring to the severity level of the state other user the current status of faults in the fault data base, 

change. An update mode field 1704 is an integer value is For each fault report, certain information may be displayed, 

specifying whether the state change is either a "set" or For example, for any given fault report generated, the status 

"increment" slate change specifying whether the given 1902 of the fault may be displayed. This may be set namely 

severity level should be used to compute the new severity by the user as being "new," "assigned," "fixed," "rejected," 

level, as discussed with reference to the state diagram in or "closed." Although all faults are assigned "new" by 

FIG. 6, above. A maximum limit field 1705 which specifies 20 default. The severity level is displayed in the field 1904. The 

the maximum limit for the computed severity level which, as type of the fault is displayed in field 1906, and the internet 

discussed with reference to the state diagram, is relevant protocol address (or other unique identities) 1908 for the 

only in the case of a "increment" state update. Finally, the device detecting the trap is displayed. The time in which the 

state rules contain an aging interval field 1706 which is an initial fault report was created is displayed in field 1910, and 

integer value specified in seconds after which the effect of 25 the number of trap occurrences that generate the fault are 

the state update should be undone. This is again applicable displayed in field 1911. Other options are presented to the 

only for "increment" types of updates. Examples of state user in fields 1914 through 1918. 1914 allows the user to 

update rules are shown below: refresh the currently displayed recorded list of fault reports 

For example, the state rule for an NMM Saturation trap (meta traps). This is done by reading in the faults report data 

can be: 30 base once again and displaying any new faults which have 

State Category -Load Deen generated since the last time the display 1900 was 

Severity Level -1 updated. Properties icon 1916 may be selected by the user 

Update Mode-Increment for s P ecif y»>g the dis P la y °P tions ° f ml j n "M™. ^ 

r . . . as various filtering, summaries, and other information which 

Maximum Liraito6 35 may be of mterest to tne netW0 rk manager or other user. The 

Aging Intervalol800 uscr may sc | ect sta tistics icon 1918 for retrieving correlation 

And a state rule for a Power Supply Failure trap can be: statistics for a particular correlator. Also, the "bell" icon 

State Category-Load 1919 indicates that there have been new faults since the last 

Severity Level=9 refresh of the correlator. Note that various icons upon the 

Update Mode=Set 40 display window are color-coded in addition to the text 

Any number of types of state change rules may be defined information displayed in order to provide additional infor- 

and used to display a color-coded representation of network mation to the user. For example, the severity level ranges 

health at various levels of abstraction (port, chassis, hub, from one to ten and different colors may be assigned to each 

etc. , .) on a user interface display, as will be described in of the severity levels according to implementation and user 

more detail below. As state updates are sent by the correla- 45 specification. For example, using the properties control 

tion engine to the state engine, and the state change is then panel to be described below the critical severity level is 

sent to the network management station, these state changes indicated by a red color, an orange color is used for 

are recorded in the network database 327, and can be indicating a medium severity level and yellow indicates a 

displayed by the network health monitor 328 to the user or low severity level of the fault. One of two icons is displayed 

network manager. 50 for each fault report at the far self portion 1901 of the display 

Statistics about each correlator are accessible by SNMP 1900. A separate "bell" icon (e.g., 1901a) is displayed for 

requests from the fault summary process 321. These are each new and unreviewed fault report. A "note" icon (e.g., 

illustrated and discussed with reference to FIG. 18. 1800 of 19016) is displayed for a fault for which comments have 

FIG. 18 shows an object which is accessible by SNMP been added by the user. 

requests in the correlator which indicates the status of 55 The properties user interface is shown with reference to 

operation of the correlator. A field 1801 maintains a record 2000 of FIG. 20. The user may specify various parameters 

of the date and time since the last reset of the correlator. of the items displayed in window 1900 of FIG. 19. The 

Thus, all traps which are currently stored in the correlator severity attribute sliders 2002 and 2004 of FIG. 20 may be 

have been stored since this previous date and time. A counter used for specifying the range of the different severity levels. 

1802 is maintained of the number of correlated traps. In 60 The use of the sliders has no effect on the rules, but rather, 

addition, another field 1803 maintains a count of the number defines the severity levels for the purposes of display. The 

of uncorrelated traps. Field 1804 indicates the number of medium value specified by 2004 must be lower than the 

meta traps generated by the correlator. Finally, a distribution critical value specified by slider 2002. The default is one 

or count of the different severities of meta traps generated is through three for low severity level, four through six for a 

maintained in a last field 1805 of the record such as a 65 medium severity level, and seven through ten for a critical 

separate set of integer values for each range of severities of severity level. The setting number is used as the floor for 

meta traps (from 1 to 10) generated. All the counters in the each rank. 
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The filter items 2006 to 2010 are used for filtering out dow 2300 may show iconic representation of network 
various information which is not of interest to the network devices which are color-coded according to their aggregated 
manager. For example, the filter by device field 2006 may be state (which is determined based upon the schemes dis- 
used for filtering out any fault reports that are not related to cussed with reference to FIGS. 6-7b). The colors which are 
the specified device. The device can be specified by any 5 uscd buy be mapped to the same color codes as discussed 
unique identifier, such as its name or network address. If no with reference to the fault summary windows. Moreover, 
device is specified then all reports for all devices are identifying icons such as the bridge icons, or router icons 
displayed in window 1900 of FIG. 19. ma y be displayed indicating the type of device for which the 

The user may also filter by severity and by status of the sta * summary information is displayed and is displayed at 

c u tl u **• *u c a' u i* ^nno , n each level in the networking hierarchy. For example, for the 

faults. Thus, by setting either any of radio buttons 2008 or 10 network as a ^ bui]d f ^ f * 

2010, various reports may not be displayed to the user in pfa ica] building s in which the network is installed may be 

window 2000 according to particular requirements of the S e parat ely displayed in row 2310. Switches of different types 

manager at the time. The user may also specify a maximum may be displa y e d with different iconic representations in 

number of fault reports for display on the list. The default for row 2320. Router health is displayed in row 2330, bridge 

the maximum number is 200, however, the user may adjust 15 health is displayed in row 2340, and hub 1, hub II health, 

this value by selecting slider 2012 and setting it to the respectively, according to the particular network's hierarchy, 

appropriate value. Finally, the user may also specify that any is displayed in rows 2350 and 2360. 

reports older than a specified number of days may be deleted Network device identifying information may also be 

from the database as shown in region 2014. displayed along with the color-coded device. Finally, as 

FIG. 21 illustrates a fault detail window which may be 20 discussed with reference to the fault detail summaries above, 

displayed upon selection of a specific fault from the fault list various filtering of the different state information according 

2100 of FIG. 21, and/or the selection of an appropriate to either device or severity level may also be displayed 

pull-down menu on the network management station con- according to user requirements. These are done in a manner 

sole. This may be used by the network manager for obtaining well-known to those skilled in the prior user interface arts, 

a detail of a fault, any probable causes to the fault, and any 25 , Yet l W« of window which may be displayed by 

possible solutions to the fault. This is displayed in 2100 of th u c nctwork !? ealth m ° nitor » " *£ te d f tai whlch 

FIG. 21. For example, in area 2102, the details shown in shows a distribution of ^verity levels of different categories 

ri „ ift 1 « 1 j * +u -„A;™ta of states for a particular network object. One example of 

FIG. 19 are displayed to he user to indicate what the ^ ^ m{) q{ rq u ^ g ^ be 

seventy level on a problem type, unique identifier date and ^ ^ Qne of ^ deviceg & , d 

time of the faults, correlator, and vendor of the device 30 mdisplay2 3 00ofFIG .23. Using a window such as 2400 of 

reporting the fault. This mformation is obtained 1 from i the p[Q $4, the network manager can then view the current 

information stored in meta trap objects such as 1100 of FIG. scverj , evels of each of ^ differem of slafc cat 

11. Then the detailed description oi ^e problem obtained rfes for ^ ^ deyke ]n , Ms ^ health of , he 

from the denned fault rule (field 1102 of FIG. 11) may be mwQ± be monitored lnd the presentation of the 

displayed m field 2104 to provide such information to the 35 information \ s nted in a most convenient way t0 (he 

network manager. Then any probable causes to the fault are networ k manager 

displayed in the sub window 2106, any proposed solutions ^ usi , he f k lech niques, a plurality of cor- 

to the problem are displayed in sub window 2108. Field felalors and a networki tem ma be used t0 enerate 

2110 is used for displaying to the user or network manager detailed ^ fof other dcviccs jn the _ whkh 

and allowing the modification of a particular ^dividual 40 m fled ^ ^ ^ a centralized falllt data 

responsible for servicing the fault If correction of the fault ^ for access fe a ^ Qr a n(;twork m f as weU ag 

^^Y^ 8 , 11 ,^^" lndi y idua1 '. 11160 V hlS u S P R 1° monitoring the current state of the devices. Note that the 

field 2110. Field 2112 is a selector icon which ,s used by the f ; e has articular advaDt ages in very large networking 

network manager for specifying the current status of the ems hayi a j number of sta(ions> devices> or other 

fault. The default is "new" however the manager can select 45 network resources Note , ha , (he fof oi has ticular 

"assigned" "closed or similar status when he has taken um m ^ ^ and &m fa ft ha& been described 

action on the fault in order that he indicate his handling of ^ reference tQ 0MlJlin ific embodim e nls in the figures 

the particular fault. Field 2114 is used for displaying and/or &nd ^ ^ Qne ^ ^ ( invention 

modifying any free text desired to be associated with the with(jut ^j. M of ^ fflc de , aik ^ me 

particular fault This may include speaal instructions to the 50 ^ ^ ^ &k {q be view6d m iUustrative sens6 

user or network manager. only, and not limit the present invention. The present inven- 

Window 2200 of FIG 22 allows the user to select the * (Q be ^ b ^ ^ ^ which 

correlator host for which he wants detailed statistics. The follow 

selected correlator host's name is displayed in window What is claimed is- 

2222, and correlator hosts may be added deleted, or 55 ^ A method of indi a &uU a mtm ^ ^ 

changed by the user selecting 2230 through 2234 and m ^ od lfae compilter . im p lementcd steps of: 

entering an IP address, name or other unique identifier in " f . . . . . 

n 1 , a *i j l \* i* j * t j .i * i . -j detectmg the occurrence of an event within the network; 

field 2222. Detailed statistics are displayed on the right side , . „ A , 

of the window in areas 2210 and 2224. 2210 is for displaying identifying the event as being a first type of event; 

in graphical form a detail of the distribution of severity 60 determining whether a threshold number of events iden- 

levels of the faults currently being displayed. Area 2224 is lifted as being the first type of event have occurred 

for displaying the total statistics for the fault correlator hosts w ^in a first predetermined time period; 

currently detected by examining the records (e.g. 1800 of if the threshold number of events have occurred within the 

FIG. 18), for each correlator using SNMP requests. first predetermined time period, then indicating a fault 

Two example displays which may be displayed by the 65 within the network; 

network management station are shown in FIGS. 23 and 24. detecting the occurrence of a further event within the 

As illustrated in FIG. 23, the network health monitor win- network after the step of indicating the fault; 
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identifying the further event as being the first type of 
event; 

determining whether an escalation threshold number of 
events identified as being the first type of event have 
occurred within a second predetermined time period 5 
and after the step of indicating the fault; and 

if the escalation threshold number of events have occurred 
within the second predetermined time period then indi- 
cating an escalation in a severity level of the fault. 

2. The method of claim 1 wherein the step of indicating 30 
the fault comprises generating a fault report incorporating a 
fault identifier and identifying the severity level of the fault. 

3. The method of claim 2 including the step of identifying 
a cause of the fault and a proposed solution to the fault in the 
fault report. 15 

4. The method of claim 1 wherein the step of indicating 
the fault comprises providing a graphical representation of 
the fault and the severity level of the fault. 

5. The method of claim 1 wherein the step of indicating 
the escalation in the severity level of the fault comprises 20 
generating a fault escalation report incorporating an 
increased severity level of the fault. 

6. The method of claim 1 including the step of indicating 
the total number of events identified as being the first type 
within an expiration time period. 25 

7. The method of claim 1 including the steps of: 
determining whether the event indicates a change in the 

state of a specific network device; and 
if the event indicates a change in the state of the specific 30 
network device, propagating information regarding the 
change in the state of the specific network device to a 
further network device effected by the change in state. 

8. The method of claim 1 including the steps of: 

in response to the step of identifying the event as being the 35 
first type of event, indicating a first network device as 
being in a first state; 

detecting the occurrence of a further event within the 
network; 

determining whether the further event is a second type of 40 
event; and 

indicating the first network device as being in a second 
state, if the further event is of the second type of event. 

9. The method of claim 1 including the step of removing 
the indication of the fault within the network after a prede- 45 
termined age period. 

10. The method of claim 1 wherein the step of detecting 
the occurrence comprises the step of receiving a trap from a 
network device. 

11. The method of claim 1 wherein the step of identifying 
the event comprises the step of determining whether the 
event occurred at a specific network device. 

12. Apparatus for indicating a fault within a network, the 
apparatus comprising: 5s 

an identification circuit to identify a network event as 

being a first type of event; 
a counter to maintain a count of network events identified 

by the identification circuit as being the first type of 

event; 60 
a comparator to determine when the count of network 

events within a first predetermined time period equals 

or transcends a threshold value; and 
an indicator to indicate a fault within the network when 

the count of network events within the first predeter- 65 

mined time period equals or transcends the threshold 

value; 
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wherein the counter maintains a count of further network 
events which occur after the indicator has indicated a 
fault, and which are identified by the identification 
circuit as being the first type of event; 

wherein the comparator determines when the count of 
further network events, within a second predetermined 
time period, equals or transcends an escalation thresh- 
old value; and 

wherein the indicator indicates an escalation fault identi- 
fying an increased severity level of the fault when the 
count of further network events, within the second 
predetermined time period, equals or transcends the 
escalation threshold value. 

13. The apparatus of claim 12 including a detection circuit 
to detect the occurrence of a network event. 

14. The apparatus of claim 12 wherein the indicator 
comprises a report generator which generates a fault report 
incorporating a fault identifier and identifying the severity 
level of the fault. 

15. The apparatus of claim 14 wherein the report genera- 
tor identifies a cause of the fault and a proposed solution to 
the fault in the fault report. 

16. The apparatus of claim 12 wherein the indicator 
indicates the total number events identified as being of the 
first type within an expiration time period, 

17. The apparatus of claim 12 including a state change 
circuit that determines whether the network event indicates 
a change in the state of a specific network device, and that 
propagates information regarding the change in the state of 
the specific network device to a further network device 
effected by the change in the state. 

18. A method of indicating a fault within a network, the 
method including the computer-implemented steps of: 

detecting the occurrence of an event within the network; 
identifying the event as being a first type of event; 
determining whether a threshold number of events iden- 
tified as being the first type of event have occurred 

within a predetermined time period; 
if the threshold number of events have occurred within the 

predetermined time period, then indicating a fault 

within the network; 
in response to the step of identifying the event as being the 

first type of event, indicating a first network device as 

being in a first state; 
detecting the occurrence of a further event within the 

network; 

determining whether the further event is a second type of 
event; and 

indicating the first network device as being in a second 
state, if the further event is of the second type of event. 

19. An apparatus for indicating a fault within a network, 
the apparatus comprising: 

an identification circuit to identify a network event as 

being a first type of event; 
a counter to maintain a count of network events identified 

by the identification circuit as being the first type of 

event; 

a comparator to determine when the count of network 
events within a predetermined time period equals or 
transcends a threshold value; and 

an indicator to indicate a fault within the network when 
the count of network events within the predetermined 
time period equals or transcends the threshold value; 
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wherein the indicator indicates a first network device as wherein the indicator indicates the first network device as 

being in a first state when the network event is identi- being in a second state, if the further event is identified 

fled as being the first type of event; as being the second type of event. 

wherein the identification circuit identifies a further event 

within the network as being a second type of event; and * * * + * 
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